Optimizing the performance of Lattice Gauge Theory simulations with Streaming SIMD extensions
نویسنده
چکیده
Two factors, which affect simulation quality are the amount of computing power and implementation. The Streaming SIMD (single instruction multiple data) extensions (SSE) present a technique for influencing both by exploiting the processor’s parallel functionalism. In this paper, we show how SSE improves performance of lattice gauge theory simulations. We identified two significant trends through an analysis of data from various runs. The speed-ups were higher for single precision than double precision floating point numbers. Notably, though the use of SSE significantly improved simulation time, it did not deliver the theoretical maximum. There are a number of reasons for this: architectural constraints imposed by the FSB speed, the spatial and temporal patterns of data retrieval, ratio of computational to non-computational instructions, and the need to interleave miscellaneous instructions with computational instructions. We present a model for analyzing the SSE performance, which could help factor in the bottlenecks or weaknesses in the implementation, the computing architecture, and the mapping of software to the computing substrate while evaluating the improvement in efficiency. The model or framework would be useful in evaluating the use of other computational frameworks, and in predicting the benefits that can be derived from future hardware or architectural improvements.
منابع مشابه
Performance of SSE and AVX Instruction Sets
SSE (streaming SIMD extensions) and AVX (advanced vector extensions) are SIMD (single instruction multiple data streams) instruction sets supported by recent CPUs manufactured in Intel and AMD. This SIMD programming allows parallel processing by multiple cores in a single CPU. Basic arithmetic and data transfer operations such as sum, multiplication and square root can be processed simultaneous...
متن کاملLattice QCD Calculations on Commodity Clusters at DESY
Lattice Gauge Theory is an integral part of particle physics that requires high performance computing in the multi-Tflops regime. These requirements are motivated by the rich research program and the physics milestones to be reached by the lattice community. Over the last years the enormous gains in processor performance, memory bandwidth, and external I/O bandwidth for parallel applications ha...
متن کاملA Performance Evaluation Of Multimedia Kernels Using AltiVec Streaming SIMD Extensions
This paper aims to provide an understanding of performance of multimedia applications that use floating-point computations on recent general-purpose microprocessors using streaming SIMD ISA extensions. We used 8 benchmarks to study the impact of these extensions on general application performance and identify the eventual bottlenecks introduced.
متن کاملAutomatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension
Introduction The discrete Fourier transform (DFT) and its fast algorithms (fast Fourier transforms or FFTs) are among the most important computational building blocks in signal processing and scientific computing. Consequently, there is a number of high performance DFT libraries available including Intel’s Integrated Performance Primitives (IPP), FFTW [6], and libraries generated by Spiral [9, ...
متن کاملSingle Instruction Multiple Data – Not Everything is a Nail for this Hammer
Hardware vendors have been struggling to fight the power and memory wall for decades [1, 2]. Since most of the processing time depends on the number of instructions, number of used registers and dependencies between instructions, but not on the size of a register, independent data items of a vector (i.e., a column) could be processed in parallel. Hence, a silver lining seems to be Single Instru...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1309.0551 شماره
صفحات -
تاریخ انتشار 2013